This is a bank marketing data set from Kaggle with the size of 897.42 KB. It contains 17 columns and 11162 rows. From the data set, we can find bank customers’ features such as age, job, marital status, education level, housing status, account balance, their past marketing responses, and so on. Through analyzing the customer characteristics in this data set, we could predict what type of customers would have higher account balance, and determine the key features of those who are more likely to subscribe our deposit service. This is meaningful for future marketing because it give us an idea about which groups of customers to target, so that the bank could smartly allocate marketing budget and optimize their future product offerings.
The project will try to answer the following questions based on the dataset:
1.People of what career have the most and the least amount of deposit in their bank account?
2.What’s the relationship of age, account balance and marital status?
3.What’s the comparison of account balance for the customers with and without housing?
4.What’s the comparison of account balance for the customers with and without loan?
5.What’s the difference in account balance for customers of different education levels?
6.How is the customer age distributed?
7.What’s the most-perfered customer contact channel?
Techniques & methodology involved:
- Various sampling methods(Simple Random Sampling, Systematic Sampling, Stratified Sampling)
- Central Limit Theorem
- Interquartile Rule
After checking the data completeness, I find that the data set is clean with no missing value, null values or duplicates.
In the original data set, there are job (categorical variable) and balance (numerical variable) column. However, directly ranking the total balance by job is not accurate enough to tell which career of people has the most/least deposit amount. Because the numbers of customers from those various career are different. We should calculate the average account balance of different occupations through the aggregate function( group by job, count the number of customers in that career, sum up the total balance of that career, and divide the balance of that career by the customer number). Thus, we get the average balance of that career. I plot the rank below with barchart which helps us to answer question 1.
Observation: The retired people tend to have lager amount of deposit than the others, while the people from the services industry have the least balance in account.
Observation: Age and balance are not perfectly positively correlated. The elderly do not necessarily have more balance in accounts. In this bank, most high-balance account owners are around age 30 to 60 except for very few outliers.
## Marital_status Avg_balance
## 2 married 1599.928
## 3 single 1457.255
## 1 divorced 1371.835
Observation: On average married group have more balance than the single; the single have more balance than the divorced group. Most of the high-balance accounts belong to the married group of clients.
Observation: Those who have no house tend to have more balance in bank account. Those who have no loan tend to have more balance in their account.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6847 122 550 1529 1708 81204
## [1] "As shown in the box plot, there are some outliers in the balance variable, which will prevent us from comparing the difference of balance by education. Therefore, I use IQR method in detecting outlier. The summary function gives us the lower and the upper limit of the balance: -2257 and 2501 . Then based on the Q1±1.5*IQR equation, 1982 outliers are detected and dropped. The dataset becomes 9180 rows and 17 columns."
Observation: Among primary, secondary and tertiary level education customers, the higher education level they have, the larger their balance medians are. So people with higher education level tend to deposit more money in their accounts.
Contact is a categorical variabe in the dataset with value of “cellular”, “telephone” and “unknown”, which indicates the best way to reach the customer.
Observation: Cellular is the most-prefered customer contact channel.
Observation: The distribution of the customer age is right skewed. The customers in the bank are mainly from 30 to 40 years old. The age of 60 is a obvious threshold: the number of customers drops dramatically from age 60.
There are diversified sampling techniques that can help us select the representative portion of population in our analysis. In this course, we were taught Simple random sampling, System sampling, Stratified sampling and Cluster sampling.
Simple random sampling is a basic sampling technique where individual subjects are selected from a larger group. In simple random sampling, every item from a frame has the same chance for selection in the sample as every other item. Samples can be chosen with replacement or without replacement. This dataset have more than 10,000 piece of record and we only select 1000 samples, so we will select without replacement.
## [1] "Original data: Mean = 41.2319476796273 SD = 11.9133691922155"
## Sample Size = 10 Mean = 41.2592 SD = 3.825063
## Sample Size = 20 Mean = 41.22335 SD = 2.679972
## Sample Size = 30 Mean = 41.19713 SD = 2.166738
## Sample Size = 40 Mean = 41.1688 SD = 1.859213
Observation: Although the customer age is not normally distributed, its sample means show the shape of a normal distribution when we have a large sample size. Furthermore, the means of the sample mean are close to the mean of the original data. The higher the sample size, the lower the standard deviations and the narrower the spread of the sample means.”
In systematic sampling, samples from the population are selected according to a random starting point but with a fixed, periodic interval. This interval, called the sampling interval, is calculated by dividing the population size by the desired sample size. Here, the sample size is 500. Expand to see the implementation of systematic sampling.
set.seed(6795)
N <- nrow(df)
n <- 500
k <- ceiling(N/n)
r <- sample(k,1)
# select every kth item
s<- seq(r, by = k, length = n)
df_sys <- df[s,]
df_sys <- df_sys[complete.cases(df_sys), ]With those sampling method, we could have some representative subgroup for more granular visualization that is not feasible with the whole dataset. For example the bubble plot below. We could have insights about the number of days between targeted marketing and deposit subscription by account of different features.
In stratified sampling, the population is subdivided into sub populations, which is called stratum. The number of samples selected from each stratum is proportional to the relative size of that stratum with respect to the entire date set. Expand please see the implementation of Stratified sampling.
library(sampling)
library(UsingR)
library(prob)
set.seed(6795)
freq <- table(df$job)
df_o <- df[order(df$job),]
st.sizes <- 500 * freq / sum(freq)
st.sizes <- as.vector(st.sizes)
st.sizes <- st.sizes[st.sizes != 0]
st.sizes <- round(st.sizes)
st <- sampling::strata(df_o, stratanames = c("job"), size = st.sizes, method = "srswr", description = F)
df_str <- getdata(df, st)In cluster sampling, the sampling unit is the whole cluster; Instead of sampling individuals from within each group, a researcher will study whole clusters. Expand please see the implementation of cluster sampling.
library(prob)
set.seed(6795)
cl <- sampling::cluster(df, c("job"),
size = 5, method="srswr")
df_cl <- getdata(df, cl)1.Difference in the balance mean
## [1] "The average balance from original dataset is 1528.53852356209"
## [1] "The average balance from systematic sampling is 1187.23298969072"
## [1] "The average balance from stratified sampling is 1532.806"
## [1] "The average balance from cluster sampling is 1402.88148148148"
2.Difference in the job structure
Observation: From the four pie charts, we could see that the percentage of different jobs are the same for the original data and the subset from stratified sampling. Because the number of samples selected from each stratum is proportional to the relative size of that stratum with respect to the entire date set.
Throughout the analysis, we have answered the questions in the project objectives.
1.The retired customer have the most account balance, and those in the service industry have the least.
2.Age and balance are not perfectly positively related.
On average, the married group have more balance than the single; the single have more balance than the divorced group. 3.Those who have no house tend to have more balance in bank account.
4.Those who have no loan tend to have more balance in their account.
5.People with higher education level tend to deposit more money in their accounts.
6.The distribution of the customer age is right skewed. The customers in the bank are mainly from 30 to 40 years old.
7.The most-prefered customer contact channel is cellular.
Future improvement
We could use machine learning technique on this dataset to have better prediction for future market campaign.